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Summary 

Invariant coordinate selection (ICS) and projection pursuit (PP) 
are two methods that can be used to detect clustering directions in 
multivariate data by optimizing criteria sensitive to non-normality. 

In particular, ICS finds clustering directions using a relative eigen- 
deconrposition of two scatter matrices with different levels of robust¬ 
ness; PP is a one-dimensional variant of ICS. Each of the two scatter 
matrices includes an implicit or explicit choice of location. However, 
when different measures of location are used, ICS and PP can behave 
counter-intuitively. In this paper we explore this behavior in a variety 
of examples and propose a simple and natural solution: use the same 
measure of location for both scatter matrices. 

Keywords: Cluster analysis; Invariant coordinate selection; Projection pur¬ 
suit; Robust scatter matrices; Location measures; Multivariate mixture model. 

1 Introduction 

Consider a multivariate dataset, given as an nxp data matrix X, and suppose 
we want to explore the existence of any clusters. One way to detect clusters 
is by projecting the data onto a lower dimensional subspace for which the 
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data are maximally non-normal. Hence, methods that are sensitive to non¬ 
normality can be used to detect clusters. 

One set of methods based on this principle is invariant coordinate selec¬ 


tion (ICS), introduced by Tyler et al. (2009), together with a one-dimensional 


variant called projection pursuit (PP), introduced by Friedman and Tukey 
( |1974 ). ICS involves the use of two scatter matrices, Si = S'i(AT) and S 2 = 
with S 2 chosen to be more robust than Si. An eigen-decomposition 
of S^Si is carried out. If the data can be partitioned into two clusters, 
then typically the eigenvector corresponding to the smallest eigenvalue is a 
good estimate of the clustering direction. The main choice for the user when 
carrying out ICS is the choice of the two scatter matrices. 

However, in numerical experiments based on a simple mixture of two bi¬ 
variate normal distributions, some strange behaviour was noticed. In certain 
circumstances, ICS, and its variant PP, badly failed to pick out the right 
clustering direction. Eventually, it was discovered that the cause was the use 
of different location measures in the two scatter matrices. The purpose of 
this paper is to explore the reasons for this strange behaviour in detail and 
to demonstrate the benefits of using common location measures. 

Section [2] gives some examples of scatter matrices and reviews the use of 
ICS and PP as clustering methods. Section[3]sets out the multivariate normal 
mixture model with two useful standardizations of the coordinate system. 
Section [4] demonstrates in the population setting an ideal situation where 
ICS and PP work as expected and where an analytic solution is available 
- the two-group normal mixture model where the two scatter matrices are 
given by the covariance matrix and a kurtosis-based matrix. Some examples 
with other robust estimators are given in Sections [5]-[6j which show how ICS 
and PP can go wrong when different location measures are used and how the 
problem is fixed by using a common location measure. 

Notation. Univariate random variables, and their realizations, are de¬ 
noted by lowercase letters, x, say. Multivariate random vectors, and their 
realizations, are denoted by lowercase bold letters, x , say. A capital letter, 
X, say is used for n x p data matrix containing p variables or measurements 
on n observations; X can be written in terms of its rows as 


(x\\ 


X = 


\KJ 

with fth row xf = (xn, ... ,Xi P ), i — 1 ,... ,n. 
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2 Background 

2.1 Scatter matrices 


A scatter matrix S(X), as a function of an n x p data matrix X is a p x p 
affine equivariant positive definite matrix. Following Tyler et al. (2009), it is 
convenient to classify scatter matrices into three classes depending on their 
robustness. 


(1) Class I: is the class of non-robust scatter matrices with zero break¬ 
down point and unbounded influence function. Examples include the 
covariance matrix defined below in (|TJ) and the kurtosis-based matrix 
hi (§. 


( 2 ) 


Class II: is the class of scatter matrices that are locally robust, in the 
sense that they have bounded influence function and positive break¬ 
down points not greater than ^C-. An example from this class is the 
class of multivariate M-estimators, such as the M-estimate for the t- 


Arslan et ah, 

1995 

Kent et ah, 

1994) 


(3) 


Class III: is the class of scatter matrices with high breakdown points 
such as the Stahel-Donoho estimate, the minimum volume ellipsoid 
(mve) (|Van Aelst and Rousseeuw 2009) and the constrained M-estimates, 
(e.g., Kent and Tyler, 1996). 


Each scatter matrix has an implicit location measure. Let us look at the 
main examples in more detail, and note what happens in p — 1 dimension. 
The labels in parentheses are used as part of the notation later in the paper. 
The sample covariance matrix (var) is defined by 



i= 1 


(i) 


where for convenience here a divisor of 1 jn is used, and where x is the sample 
mean vector. The implicit measure of location is just the sample mean. 

The kurtosis-based matrix (kmat) is defined by 

1 n 

K = - - x) T S~ 1 (x i - x)}(xi - x)(xi - x) T . (2) 

2=1 


Note that outlying observations are given higher weight than for the covari¬ 
ance matrix, so that K is less robust than S. Again the implicit measure 
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of location is just the sample mean. When p — 1, the scatter matrix S~ X K 
reduces to 3 plus the usual univariate kurtosis. 

The M-estimator of scatter based on the multivariate ^-distribution for 
fixed v is the maximum likelihood estimate obtained by maximizing the likeli¬ 
hood jointly over scatter matrix £ and location vector p. If both parameters 
are unknown and v > 1, then under mild conditions on the data, the mle of 
(p, £), is is the unique stationary point of the likelihood. Similarly, if v > 0 
and p is known, the mle of £, is is the unique stationary point of the like¬ 
lihood (Kent et al. 1994). In either case, an iterative numerical algorithm 
is needed. Note that when p is to be estimated as well as £, the mle of p 
is the implicit measure of location for this scatter matrix. For this paper we 
limit attention to the choice v — 2 (and label it below by t2). 

The minimum volume ellipsoid (mve) estimate of scatter S mve , introduced 
by Rousseeuw (1985), is the ellipsoid that has the minimum volume among 
all ellipsoids containing at least half of observations, and its implicit esti¬ 
mate of location, x mve , say, is the centre of that ellipsoid. Calculating the 
exact mve requires extensive computation. In practice, it is calculated ap¬ 
proximately by considering only a subset of all subsamples that contain 50% 
of the observations, 


fe.g. 


Maronna et al. 2006; Van Aelst and Rousseeuw 


2009). If the location vector is specified, the search is limited to ellipsoids 


centred at this location measure. 

When p — 1, the mve reduces to the lshorth, defined as the length of 
the shortest interval that contains at least half of observations. The corre¬ 
sponding estimate of location, i^horth, say, is the midpoint of this interval. 
Calculating the lshorth around a known measure of location is trivial; just 
find the length of the interval that contains half of observations centered at 
this location measure. The lshorth was introduced by Grubel (1988), build¬ 
ing on earlier suggestion of Andrews et al. (1972) to use cci s hon,h, which they 
called the shorth, as a location measure. 

The minimum covariance determinant estimate of scatter (med), S mc d is 
defined as the covariance matrix of half of observations with the smallest 
determinant. The med location measure, a; mc d, say, is the sample mean of 
those observations. The med can be calculated approximately by considering 
only a subset of all subsamples that contain at least half of observations, (e.g., 


Rousseeuw and Driessen. 1999). The med estimate of scatter with respect to 


a known location measure p is defined as the covariance matrix about p of 
half of observations with the smallest determinant. Recall that the covariance 
matrix about p for a dataset is given by S + (p — x)(p — x) T , where S and 
x are the sample covariance matrix and mean vector of the dataset. 

When p — 1, the med reduces to a truncated variance, u tm nc> sa V defined 
as the smallest variance of half the observations. Its implicit measure of 
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location, x t runc, say, is the sample mean of that interval. Also, a modified 
definition of n trunc using a known location measure is trivial and does not 
require any search; just find the interval that contains half of observations 
centered at the given location measure and calculate the variance. 

Routines are available in R (R Core Team, 2014) to compute (at least 
approximately) these robust covariance matrices and their implicit location 
measures, in particular, tM from the package ICS (Nordhausen et al. 2008) 


for the multivariate t-distribution, cov.rob from the package MASS 
and Ripley |2002 ) for 


Venables 

and CovMcd from the package rrcov (Todorov 
and Filzmoser, 2009) for mcd. Modified versions of these routines have been 


mve, 


written by us to deal with the case of known location measures. 


2.2 Invariant coordinate selection and projection pur¬ 
suit 


Given annxp data matrix X, the ICS objective function is given by the 
ratio of quadratic forms 

, . a T Sid , , 

Kics(a) = ° e ’ (3) 

where S\ = S\ (X) and S 2 = S 2 (X) are two scatter matrices. By conven¬ 
tion, S 2 is chosen to be more robust than Si. For exploratory statistical 
analysis, attention is focused on the choices for a maximizing or minimizing 
kics(o)- These values can be calculated analytically as the eigenvectors of 
S 2 _1 Si corresponding to the maximum/minimum eigenvalues. 

The original ICS method did not make a strong distinction between the 
largest and the smallest eigenvalues. However for clustering purposes between 
two groups, when the mixing proportion is not too far from 1/2, it is the 
minimum eigenvalue which is of interest; see Section |4j 

The method of PP can be regarded as a one-dimensional version of ICS. 
It looks for a linear projection a to maximize or minimize the criterion, 


Kpp(a) 


si(Xa) 
s 2 (Xa)' 


(4) 


where si = si(Xa) and s 2 = s 2 (Xa) are two one-dimensional measures 
of spread. In general, optimizing Kpp(a) must be carried out numerically. 
Searching for a global optimum is computationally expensive, and the com¬ 
plexity of the search increases as the dimension p increases. Alternatively, 
we can search for a local optimum starting from a sensible initial solution, 
such as the ICS optimum direction. 
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Both ICS and PP are equivariant under affine transformations. That is, 
if X is transformed to U = 1 n h T + XQ T , where Q(p x p ) is nonsingular and 
h is a translation vector in M p , then for either ICS or PP the new optimal 
vector b , say, for U is related to the corresponding optimal vector a for X 
by 

b oc Q~ t a. (5) 

For numerical work it is convenient to have an explicit notation for the 
different choices in ICS and PP. If Scatl and Scat2 are the names of two types 
of multivariate scatter matrix, each computed with its own implicit location 
measure, then the corresponding versions of ICS and PP will be denoted 

ICS : Scatl : Scat2, and PP : Scatl : Scat2. 

Note that PP is based on the univariate versions of Scatl and Scat2. For 
example, ICS based on the covariance matrix and the minimum volume el¬ 
lipsoid will be denoted by ICS:var:mve. Other choices for scatter matrices 
have been summarized in Section [2] 

When a common location measure is imposed on Scatl and Scat2, then 
this restriction will be indicated by the augmented notation 

ICS : Scatl : Scat2 : Loc, 

and similarly for PP. In this paper the only choice used for the location mea¬ 
sure is the sample mean (mean). For example, ICS based on the covariance 
matrix and the minimum volume ellipsoid, both computed with respect to 
the mean vector, is denoted 

ICS : var : mve : mean. 

3 The two-group multivariate normal mix¬ 
ture model 

The simple model used to demonstrate the main points of this paper is the 
two group multivariate normal mixture model, with density 

/(*) = Mi, O) + (1 — q)<f> p {x, /i 2 , fi), 

where </> p is the multivariate normal density, n l and p, 2 are two mean vectors, 
O is a common covariance matrix, and 0 < q < 1 is the mixing proportion. 
Even in this simple case, major problems with ICS and PP can arise. 
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Since ICS and PP are affine equivariant, we may without loss of generality 
choose the coordinate system so that 


Mi — aei, H 2 — —ote i, fl — I p , 

where e\ — (1, 0,.,., 0) T is a unit vector along the first coordinate axis, and 
a > 0. That is, Mi and M 2 lie equally spaced about the origin along the 
first coordinate axis, and the covariance matrix of each component equals 
the identity matrix. 

A random vector x from the mixture model can also be given a stochastic 
representation, 

x = asei + e, 

where e ~ N p ( 0, I p ) independently of an indicator variable s, 

( 1 with probability q 

\ — 1 with probability (1 — q) 

Moments under the mixture model are calculated most simply in terms of 
this stochastic representation. In particular, 

Mx = E(x) = qivt x + (1 - q)n 2 = (2 q - l)ae u E(xx T ) = a 2 e x e^ + I p , 


so that the covariance matrix is 


Ex = var(cc) = E(xx T ) - = 4g(l - q)a 2 e 1 ef + I p . (6) 

For practical work it is also convenient to consider a standardization for 
which the overall covariance matrix is the identity matrix. That is, define a 
new random vector 

y = C~ l x, (7) 

where = diag(l/ci,..., l/c p ), where C\ = {1 + 4g(l — q)a 2 } 1 ^ 2 , and 
c 2 = • • • = c p = 1. Then y has a stochastic representation 

y = 5se i + r/, 


where 

and 


5 = a/{ 1 + 4g(l - q)a 2 } 1/2 , 


77~iV p (0,diag(aJ,l,...,l)) 


( 8 ) 


where the first diagonal term a 2 has two equivalent formulas, 

a 2 = {l + Aa 2 q{l - q)Y l or a 2 — 1 - Aq{l — q)8 2 
The first two moments of y are 

H y = (2q - l)<$ei, E y = I p . 
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4 A population example: PP based on the 
kurtosis and ICS based on the kurtosis- 
based matrix and the covariance matrix 


In this section we look at ICS:kmat:var and PP:kmat:var in the population 
case. In this setting it is possible to derive analytic results. Note that since 
krnat is based on fourth moments it is less robust than the variance matrix; 
hence kmat is listed first. 

Recall the kurtosis of a univariate random variable u, say, with mean /i u , 
is defined by 


kurt(w) 


E{(u - fi u ) 4 } 
[E {(u-Vu) 2 }} 2 


The univariate kurtosis is zero when the random variable has normal distri¬ 
bution. For non-normal distributions the kurtosis lies in the interval [—2, oo] 
and is often nonzero. 


Pena and Prieto (2001) studied the population version of PP:kmat:var 
and showed that when the mixing proportion is not too far from 1/2 (more 
precisely, if g(l — q) > 1/6, i.e. 0.21 < q < 0.79), then minimizing the PP 
objective function picks out the correct clustering direction. 

Their result can be derived simply as follows. Let a 6 R p be a unit vector. 
Write a T x = craiS + v, where v = a 1 e ~ N( 0,1) is independent of s. The 
moments of s are E(s) =E(s 3 ) = m, say, where 


m — 2q — 1, (9) 

and E(.s 2 ) = E(s 4 ) = 1. Hence, var(s) = cr 2 , say, where 

0-2 = 4g(l - q). (10) 


Then 

kurt(s) = —6 + 4/a 2 . 

It can be checked that kurt(s) < 0 provided q( 1 — q) > 1/6. 

Next, we use the property that if u\, U 2 are independent random variables 
with the same variance, and if 61,62 are coefficients satisfying 6 f + 6 \ = 1, 
then 

kurt(cVui + 62 U 2 ) = 5 4 kurt(u 1 ) + (Ljkurt^)- 
Applying this result to a T x yields 


4 4 4 

a\a a 


(a 2 afa 2 + l) 2 


kurt(s). 


kurt(a T a;) 


( 11 ) 







Provided kurt(s) < 0, 0 is minimized when a\ is maximized, that is, if 
a 2 — 1, so that a = ±e± picks out the first coordinate axis. 

The ICS calculations proceed similarly. First note that, the first diagonal 
term in E x , defined in (|6]) , can be expressed in terms of a 2 , defined in (10), 
as a 2 a 2 + 1. 

The first factor in the population version of K defined in (|2]) , K x , say, is 
given by 

O - ^x) T ^\ x - Mx) = ^ am ) + x\ + • • • + x 2 = D 2 , say, 

where m is dehned in Q. Note that D 2 is an even function in x 2 , ■ ■ ■ ,x p . 
Hence by symmetry all the off-diagonal terms in K x vanish. The first diagonal 
term is given by 


E{D 2 (x i - am) 2 } = (1 + aV)(p + 2) + 


a 4 cr 4 kurt(^ 


(1 + a 2 o 2 ) 

The remaining diagonal terms, j — 2,... ,p are given by 


Hence reduces to 


E{D 2 x 2 } = p + 2. 


diag(p + 2 + 


kurt(s)a 4 a 4 
(1 + a 2 a 2 ) 


,P + 2,. 


■■,P + 2). 


These diagonal values are the eigenvalues. Hence provided kurt(s) < 0, k ics 
is minimized when a = e 1; that is, when a picks out the clustering direction. 

If p = 2, we can write a unit vector as a = (cos0,sin0) T , and since a 
and —a define the same axis, we can parameterize the ICS and PP objective 
functions in terms of 9, —tt/2 < 9 < tt/2. Plots of Kics{@) and Kpp(9) for 
a = 3 and q = 1/2 are shown in Figure [I] 

For numerical work, especially when the underlying mixture model is 
unknown, the only feasible standardization is to ensure the overall variance 
matrix 'Ey is the identity rather than the within group variance matrix. In 
terms of the population model of this section, it means working with y from 
([7]) rather than x. If p = 2 and b oc (cos0, sin0) r , say, is also written in 
polar coordinates, then from (J5]) and (j7j) a and b are related by 


b oc Ca, 


hence, <f> and 9 are related by 

/cos<A (Ci 0\ f cos 9\ 

\sm 4>J ^ y 0 c 2 ) ysin 6 J 
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Figure 1: Plot of the population criteria finest) ( re d dotted line), and Kpp(6 ) ) 
(solid black line) versus 6, for q — 1/2, a — 3. 


Thus, 

tan0 = ctand, 

where c = c?,fc\. 

The plot of the ICS and PP objective functions in Figure [2] shows that 
there is a sharper minimum in 0 coordinates than in 9 coordinates because 
under our mixture model c is less than 1. If tc is scaled as in (J7]) with c\ > c 2 , 
i.e c > 1, then there will be a wider minimum in 0. 





Figure 2: Plot of the population criteria Kics(^) ( re d dotted line), and ftpp(0) 
(solid black line) versus 0, for q — 1/2, and 5 = 0.95. 
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5 The effect of using a common location mea¬ 
sure on ICS and PP 


As we mentioned earlier in Section 2.2 , the ICS and PP criteria are expected 
to have similar behaviour to the kurtosis-based criteria in Section [4} Namely, 
they are expected to be minimized in the clustering direction when the mixing 
proportion is not too far from 1/2. 

However, when applying ICS with at least one robust estimate of scatter 
(mainly from Class III), some peculiar behaviour was observed. In particular, 
the ICS criterion was often maximized in the clustering direction rather than 
minimized. 

Here is an explanation. Under the two-group mixture model with one 
group slightly bigger than the other, a class III scatter matrix will typically 
home in on the larger group, with its corresponding location measure at the 
center of this group and its estimate of the scatter matrix capturing the 
spread of this group. The other scatter matrix (Class I or II) will measure 
the overall scatter of the data with its corresponding location measure at the 
overall center of the data. The result is erratic behaviour in kjcs and Kpp. 

Imposing a common location measure on the two scatter matrices fixes 
this problem. Here is an example in p — 2 dimensions to illustrate the issues 
in greater detail. 

In this example we look at ICS:var:mve for the population bivariate nor¬ 
mal mixture model in Section [3j with q — 1/2 and any value of a > 0, i.e. 
0 < 5 < 1, where 5 is given in (J8]) . Standardize the coordinate system so 
that the overall covariance matrix is the identity, = J 2 . Let £ mve denote 
the population minimum volume ellipsoid scatter matrix. 

Then it turns out that £ mve is the within-group covariance matrix for 
(either) one of the groups, 


£ 


mve 


l-h 2 

o 



( 12 ) 


where 0 < 8 < 1 is given in Q. The implicit estimate of the center of the 
data will be given by the center of either group, ±5ei; both values fit equally 

well. 

Figure [3] shows that the clustering direction estimated by the ICS:var:mve 
method is the direction that minimizes Kies (the eigenvector of the smallest 
eigenvalue of namely (0,1) T , i.e. 0 = ±7 t/ 2. However, the true 

direction of group separation direction is (1,0) T , i.e. 0 = 0. 

Next consider ICS:var:mve:mean, i.e. the common mean version of the 
previous example. The overall mean of the data is at the origin. When 
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- 1.5 - 1.0 - 0.5 0.0 0.5 1.0 1.5 


Figure 3: Plot of the population criterion of ICS:var:mve vs. 0 for 5 = 0.9. 

£ mve is constrained to have its location measure at the origin, then the ICS 
criterion now picks out the true clustering direction. In order to give an 
analytic proof of this result, we restrict attention to the the limiting case of 
the balanced mixture model, i.e when 5=1, q = 1/2. Hence, the group 
components will lie on two parallel vertical lines with means 


/C = (1,0) T , to = {- 1,0) T , 


and within-group covariance matrix 



In this setting, it can be shown that the population version of the MVE 
matrix, £ mve , say, takes the form 



where d = $ _1 (.75) = 0.674, the 75th quantile of the standard normal dis¬ 
tribution (see the Appendix). Hence the dominant eigenvector is e\. The 
ellipse of £ mve is plotted in Figure [4j Figure [5] shows that the criterion of 
ICS:var:mve:mean, k ics:M ( 0) picks out the correct clustering direction e\. 

Like ICS, PP can fail to detect the clustering direction if applied using 
different location measures. The reason for that is the projection direction 
that separates the data into two groups with one slightly bigger than the 
other, the more robust measure of spread will be located at the larger group. 
In Section [6j we give a detailed numerical example of the problem arising from 
using two different location measures in PP:var:mcd, and how the problem 
is fixed by using a common location measure. 
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Figure 4: Plot of the ellipse of the constrained E mve . 



- 1.5 - 1.0 - 0.5 0.0 0.5 1.0 1.5 

♦ 


Figure 5: Plot of the population criteria of ICS:var:mve:mean, KicS:fj,(4>)■ 

6 Examples 

Overview 

In this section, we give numerical examples that demonstrate different ways 
in which ICS and/or PP can go wrong. We also show the effect of using 
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common location measures in these examples. We use one simulated data 
set and apply different ICS and PP methods, with and without imposing a 
common location measure (the mean). 

A two-dimensional data set of size n = 500 is generated from the balanced 
mixture model, defined in Section [3j with q = 1/2, and a = 3, so that 
5 = 0.95. Thus the two groups are well-separated and no sensible statistical 
method should have any problem finding the two clusters. All calculations 
are done after standardization with respect to the “total” coordinates. That 
is, the data matrix Y(500 x 2) is standardized to have sample mean 0 and 
sample covariance matrix I 2 . 

The ICS and PP methods used are: 

(1) (PP,ICS):var:t2 with corresponding criteria Kj CS , and nics- 

(2) (PP, ICS) ivarmcd with corresponding criteria k| cs , and k pp . 

(3) (PP,ICS):var:mve with corresponding criteria nf cs , and k pp . 

(4) (PP,ICS):t2:mcd with corresponding criteria k/ cs , and k pp . 

(5) (PP,ICS):t2:mve with corresponding criteria Kj CS , and k pp . 

When imposing the mean as the common location measure, the ICS and PP 
criteria will be denoted by ^ CS:mean and K J PP:mean , where j = 1,..., 5. 

To understand the behaviour of the ICS and PP, their criteria are plotted 
against —7 t/2 < (j) < 7t/ 2. The plots are shown in Figure [bj 
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(e) (PP,ICS):var:mve 


(f) (PP,ICS):var:mve:mean 


Figure 6 


From the panels in Figure [6j we make the following remarks based on the 
simulated data set: 
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(g) (PP,ICS):t2:mcd 




(h) (PP,ICS):t2:mcd:mean 





$ 


(i) (PP,ICS):t2:mve (j) (PP,ICS):t2:mve:mean 


Figure 6: For 5 = 0.95 and q = 1/2, plots of different ICS (red dashed 
curve) and PP (black solid curve) criteria without (left) and with imposing 
a common location measure (right). 
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(1) Panel (a) shows that ICS:var:t2 and PP:var:t2 work well since y and y t2 
are approximately equal. Hence, imposing a common location measure 
has little effect, as shown in (b). 

(2) Panels (c), (e), (g), (i) show examples when ICS and/or PP go wrong 
because of the difference in the location measures. 

(3) Using a common location measure fixes the problem in panel (d) for 
(PP, ICS):var:mcd, panel (f) for (PP, ICS):var:mve, and panel(h) for 
(PP, ICS):t2:mcd. 

(4) From panel (j), using a common location measure in PP:t2:mve:mean 
does not seem to work well. The reason might be due to the unstable 
behaviour of the mve and lshorth. 


(5) The plots generally suggest that PP will be more accurate than ICS, 
since the PP plots are narrower at the clustering direction than the 
ICS plot. This property has been confirmed empirically in lAlashwali 


(2013) for certain multivariate normal mixture models and choices of 
scatter matrix. 


(6) Similar patterns are seen with most simulated data sets from this 
model. 


Behaviour of ICS:var:mcd 

To gain a deeper understanding of the behaviour of ICS:var:mcd in panel [6] 
(c) and the effect of forcing a common location measure on mcd in panel (d), 
we plot the ellipses of S mc d ( with and without imposing a common location 
meaure) and superimpose it on the data points of our example. The plots 
are shown in panels [7] (a) and (b). The behaviour in this example agrees 
with the interpretation given for the population example in Section [5} 
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'T — 



(a) (b) 


Figure 7: Plots of the ellipses of mcd scatter matrix based on (a) mcd location 
measure, and (b) the sample mean, superimposed on data of size n = 500, 
distributed as mixtures of two normal distributions. 

Behaviour of PP:var:mcd 

The objective function for PP:var:mcd, has a similar problem to ICS; it is 
maximized rather than minimized near the correct clustering direction. 

To understand this behaviour in more detail, we plot in Figure [8] one¬ 
dimensional histograms after projections by the following choices for the 
angle 0: 0°, 15°, 30°, and 90°. For each histogram, we plot the 50% of the 
data that has the smallest variance, and the corresponding location measure 
^tmnc- The plots are repeated where the location measure is constrained at 
the sample mean x — 0. 
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The shape of the histograms depends on of the projection directions. Note 
that as nt m n C gets smaller, the PP criterion k pp gets larger. 

(1) The 0° projection produces two widely separated groups with one group 
is slightly bigger than the other. In this case, x t nmc is at the larger group 
and Vtrunc is essentially the variance of this group. Hence u trunc takes 
its smallest value and k pp is largest. 

(2) The 15° projection produces two slightly separated groups with within- 
group variance is larger than in the 0° projection. The value of u trU nc 
is larger than for 0°. 

(3) The 30° projection produces one group, with a pseudo-uniform distri¬ 
bution. The value of v tmnc is larger than for 15°. 

(4) The 90° projection produces one normally distributed group. The value 
for v t runc becomes small again. 

Constraining the mean to be at the origin fixes the problem. The value 
of Utrunc steadily decreases from 0° to 90°. 


Appendix 

Consider the limiting balanced bivariate normal mixture model, 


y = se 1 + ze 2 , 

where s = ±1, each with probability 1/2, independent of z ~ N( 0,1), and 
ei = (1,0)*, e 2 = (0,1) T . This model is standardized with respect to the 
“total” coordinates; i.e. E(y ) = 0 and va i(y) = I 2 . The model can also be 
described in terms of a mixture of two normal distributions, concentrated on 
the vertical lines y\ = 1 and y\ = —1. 

In this appendix we shall show that the population version of the mve, 
constrained to be centred at at the origin, is given by 

r _ [ 2 °' 

mve |_° d 2 \ ’ 

where d = $ _1 (.75) in terms of the cumulative distribution function of the 
N( 0,1) distribution. 

First let r u,\ < u 2 be two possible values for y 2 and consider and ellipse 
based on a matrix E, with inverse E -1 = fl, 

y T Vty = 1 (A.l) 
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which intersects the vertical lines at these points, 


[1 ui\ n 


1 

Ui 


= 1, 


[l U 2 \ H 


1 

U 2 


= 1. 


(A.2) 


By symmetry the ellipse also intersects the points (—1, — Ui) T and (—1, —u 2 ) T . 
Note that £ will be a candidate for the mve matrix if the interior of the ellipse 
covers 50% of the probability mass, that is, 


<L(m 2 ) = $(«i) + 1/2. 


(A.3) 


If Ui and u 2 are finite, then necessarily u\ < 0 and u 2 > 0. 

The proof will proceed in two stages. First, for fixed u\ , u 2 satisfying 
(A.3), we choose £ to minimize det(£) (or equivalently maximize det(fl)). 
Secondly, we optimize over the choice of U \, u 2 . 

Thus, start with a fixed pair of values iq, u 2 satisfying (A.3). If y = 
(1 ,u) T represents a point on one of the vertical lines, then the intersection 
with the ellipse (A.l) can be written 

ujii + 2u>i 2 u + ui 22 u 2 = 1, 

or equivalently as the quadratic equation in u, 

Au 2 + Bu + C = 0, 

where A = uj 22i B = 2cui 2 , C = oj n — 1. If this ellipse passes through (1, U\) T 
and (1 ,u 2 ) t , then then iq, u 2 are roots of the quadratic equation, so 


Mi, u 2 = 


-B ± VB 2 - 4 AC 

2 A 


(A.4) 


In particular, setting M = {u\ + u 2 )/2 to be the mean of the roots, and 
P = U\U 2 to be the product of the roots, we have 


M = P = — = ~ 

2 A uj 22 A u > 22 


(A.5) 


Let us try to maximize det(fl) subject to the ellipse satisfying (A.2). Start 
with an arbitrary cu 22 > 0. Then (A.5) determines the remaining elements of 

n, 

C0 12 = —Moj 22 , CUh = 1 + Puj 22 . 


Hence 


det(fl) — U\iUJ 22 — uj 2 2 — w 22 — Q 1 ^ 22 , 
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where 


Q = M 2 -P=-(u l -u 2 ) 2 > 0. (A.6) 

Maximizing det(D) with respect to the choice of ca 22 leads to cu 22 = 1/(2 Q) 
and 

det(D) = 1/(4 Q). 

The remaining task is to choose U\ < 0 (which determines U 2 > 0 by 


(A.3)) to maximize det(fl), or equivalently, to minimize Q in (A.6) 


Recall a basic result from calculus. If t = f(u ) and u = g{t) are monotone 
functions which are inverse to one another, then g(f(u)) = u. Differentiating 
two times yields the relation between the derivatives, 


9' = 1//, 9" = -f"/{f'Y 


In particular, consider f(u ) = $(«), with derivatives f'(u) = cp(u) and 
f"{u ) = —u(p(u), where 0(w) is the probability density function of A^(0,1). 
Then g(t) = $ -1 (f) with derivatives g'(t) = 1 /4>(u) and g"{t) = u/{(p(u)} 2 , 
where u = 4> _1 (f). 

With this notation, write U\ = g{t) for 0 < t < 1/2. Then 112 = g(t + 1/2). 
Write 1 = <f >2 = 4>(u 2 ). The quantity Q in ( A.6| ), treated as a function 

of t, has derivatives 


Q! = - {uiu[ - Uill ' 2 - u\u 2 + u 2 u'. 2 } 

= \ {Wl(l/01 - 1/02) + ^2(1/02 - l/0l)} 

Q" = ^ {u\u[ + {u\) 2 - u 1 u 2 - 2 vl x vl 2 - u"u 2 + u 2 u 2 + (u 2 ) 2 } 

= \ {“l/01 + 1/01 - MlM 2 /02 - 2/(0l0 2 ) - MlM 2 /0i + W 2 /02 + 1/01} 

= 2 { (-*-/ 0 ! _ 1 / 02 )“ + M l /01 d - U l m 2 /(01 + 02 ) + ^ 1/02 } • 

If u\ = —d, then u 2 = d and 0i = 0 2 so that the hrst derivative vanishes. For 
all (0 < t < 1/2), the second derivative is positive, so the function is convex. 
Hence Q is minimized for u\ = — d, u 2 = d. Then M = 0, Q = —P = a f 2 and 
the optimal E becomes 


E = IT 1 


2 0 
0 2d 2 ’ 


as required. 
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(a) ^trunc — 0.08, fEtrunc — 0.93 

15° 


(b) ^trunc — 0.54, X — 0 

15° 



(c) ^trunc — 0.14, fCtrunc — 0.92 (d) ^trunc — 0.42, X — 0 




(^) ^trunc — 0.21, Xtrunc — 0.63 (f) ^trunc — 0.26, 3?trunc — 0.02 

90 ° 90 ° 


-3 -2 -1 0 


- 3 - 2-10 1 2 3 


(§) ^trunc — 0.12, Xtrunc — 0.18 (h) Xtrunc — 0.14, Xtrunc — 0 


Figure 8: Histograms of 0°,15°,30° and 90° projections. Left panels show 
the vectors of 50% of data with the smallest variance (the blue lines), and 
its location measure (the red lines), right panels show the 50% of data with 
the smallest variance computed around the mean 0. 
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